Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Nov 17, 2025

See #16095 (comment) for case.

@CISC
Copy link
Collaborator

CISC commented Nov 17, 2025

Add testcase or it didn't happen. :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 17, 2025

Add testcase or it didn't happen. :)

Look, children, that's how an evil maintainer looks like :P Will never let you off the hook with any PR, ever!

@pwilkin pwilkin requested a review from slaren as a code owner November 17, 2025 22:29
@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 17, 2025

update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
  CONT(type=f32,ne=[10,10,10,1]): OK
  CONT(type=f32,ne=[2,1,1,1]): OK
  CONT(type=f32,ne=[2,1,3,5]): OK
  CONT(type=f32,ne=[2,3,5,7]): OK
  CONT(type=f16,ne=[2,1,1,1]): OK
  CONT(type=f16,ne=[2,1,3,5]): OK
  CONT(type=f16,ne=[2,3,5,7]): OK
  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
[CONT] NMSE = 0.447623183 > 0.000000100   CONT(type=f32,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 2.241813873 > 0.000000100   CONT(type=f32,ne=[1,8,17,1]): FAIL
[CONT] NMSE = 0.058848433 > 0.000000100   CONT(type=bf16,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 1.181509486 > 0.000000100   CONT(type=bf16,ne=[1,8,17,1]): FAIL
  10/14 tests passed

vs

  update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
  CONT(type=f32,ne=[10,10,10,1]): OK
  CONT(type=f32,ne=[2,1,1,1]): OK
  CONT(type=f32,ne=[2,1,3,5]): OK
  CONT(type=f32,ne=[2,3,5,7]): OK
  CONT(type=f16,ne=[2,1,1,1]): OK
  CONT(type=f16,ne=[2,1,3,5]): OK
  CONT(type=f16,ne=[2,3,5,7]): OK
  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
  CONT(type=f32,ne=[1,4,2,1]): OK
  CONT(type=f32,ne=[1,8,17,1]): OK
  CONT(type=bf16,ne=[1,4,2,1]): OK
  CONT(type=bf16,ne=[1,8,17,1]): OK
  14/14 tests passed
  Backend CUDA0: OK

@CISC There :)

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 18, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix and sorry for not catching the bug during review; some functionality that would have covered this was removed and the logic was not adjusted.

Preferably add an argument for the existing tests for GGML_OP_CONT rather than adding a new test case.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@JohannesGaessler okay, integrated it into the CONT test, added a flag use_view_slice that decides whether you want transpose or view with a slice to achieve incontiguity. Added a few more tests for completeness.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the necessary permissions to hit the merge button or do I need to do it?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

Do you have the necessary permissions to hit the merge button or do I need to do it?

Nope, don't have write access. But I'll address the Evil Maintainer's concerns first :D

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

Alright, can merge. But you know @CISC that once you gave me this idea I'm going to run around and refactor all the tests now to use loops? :P

@JohannesGaessler
Copy link
Collaborator

@pwilkin please do a more general refactor of tests in a dedicated PR.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@pwilkin please do a more general refactor of tests in a dedicated PR.

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

(and BTW, don't worry, I'm not opening a new refactoring till I'm done with the convert_hf_to_gguf.py one)

@CISC
Copy link
Collaborator

CISC commented Nov 18, 2025

(and BTW, don't worry, I'm not opening a new refactoring till I'm done with the convert_hf_to_gguf.py one)

Great, I'll keep you busy with changes there then. 👿

@JohannesGaessler
Copy link
Collaborator

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

@CISC
Copy link
Collaborator

CISC commented Nov 18, 2025

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

It doesn't actually fail on the test BTW, it fails because with F16 the view offset of [1,4,2,1] tensor is not a multiple of 4. cc/ @reeselevine

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@CISC @JohannesGaessler So wait, should I remove the F16 testcase or is it actually something that should be fixed on the WebGPU backend?

@CISC
Copy link
Collaborator

CISC commented Nov 18, 2025

@CISC @JohannesGaessler So wait, should I remove the F16 testcase or is it actually something that should be fixed on the WebGPU backend?

You can change it to a tensor where the view offset is a byte-multiple of 4 for now.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

Okay, I've taken out the F16/BF16 tests with use_view_slice for now.

BTW, I think the problem might be more subtle, because the tensor {1, 4, 2, 1} is a multiple of 4. What happens is that the view slice offset calculated is possibly not a subset of 4.

@CISC
Copy link
Collaborator

CISC commented Nov 18, 2025

Okay, I've taken out the F16/BF16 tests with use_view_slice for now.

BTW, I think the problem might be more subtle, because the tensor {1, 4, 2, 1} is a multiple of 4. What happens is that the view slice offset calculated is possibly not a subset of 4.

Yeah, I edited my misleading comment. :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@JohannesGaessler WebGPU succeeded this time so I think it's safe to merge.

@reeselevine
Copy link
Collaborator

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

It doesn't actually fail on the test BTW, it fails because with F16 the view offset of [1,4,2,1] tensor is not a multiple of 4. cc/ @reeselevine

Yep, looks like this is a bug in the WebGPU set_tensor function:
https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-webgpu/ggml-webgpu.cpp#L1500-L1518

WebGPU has more stringent alignment requirements for portability. This code considers the cases where the offset + size is not a multiple of 4, but not when the starting offset is not a multiple of 4!

Should be a relatively easy fix. I might prefer to fix it in the same PR that adds back the test cases that currently fail, so I am happy to work with whoever creates that PR on the fix.

@JohannesGaessler
Copy link
Collaborator

@pwilkin my opinion is that we should merge the fix for CUDA earlier rather than later. That is the important part; the main purpose of the tests is to avoid having the same breakage again later on. But if the addition of a test case breaks the tests themselves we should put those into a different PR in the meantime. And we should definitely be keeping the problematic case for webGPU rather than work around it.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@JohannesGaessler yeah, I kept the regression tests, I just added a skip (and TODO) for the WebGPU cases.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 18, 2025

@reeselevine Once this gets merged you can just remove the continue block with the TODO comment for the failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants